29 research outputs found
Cross-Modal Learning for Sketch Visual Understanding.
PhD Theses.As touching devices have rapidly proliferated, sketch has gained much popularity as an
alternative input to text descriptions and speeches. This is due to the fact that sketch
has the advantage of being informative and convenient, which have stimulated sketchrelated
research in areas such as sketch recognition, sketch segmentation, sketch-based
image retrieval, and photo-to-sketch synthesis. Though these eld has been well touched,
existing sketch works still su er from aligning the sketch and photo domains, resulting
in unsatisfactory quality for both ne-grained retrieval and synthesis between sketch and
photo modalities. To address these problems, in this thesis, we proposed a series novel
works on free-hand sketch related tasks and throw out helpful insights to help future
research.
Sketch conveys ne-grained information, making ne-grained sketch-based image retrieval
one of the most important topics for sketch research. The basic solution for this task
is learning to exploit the informativeness of sketches and link it to other modalities.
Apart from the informativeness of sketches, semantic information is also important to
understanding sketch modality and link it with other related modalities. In this thesis,
we indicate that semantic information can e ectively ll the domain gap between sketch
and photo modalities as a bridge. Based on this observation, we proposed an attributeaware
deep framework to exploit attribute information to aid ne-grained SBIR. Text
descriptions are considered as another semantic alternative to attributes, and at the same
time, with the advantage of more
exible and natural, which are exploited in our proposed
deep multi-task framework. The experimental study has shown that the semantic
attribute information can improve the ne-grained SBIR performance in a large margin.
Sketch also has its unique feature like containing temporal information. In sketch synthesis
task, the understandings from both semantic meanings behind sketches and sketching
i
process are required. The semantic meaning of sketches has been well explored in the
sketch recognition, and sketch retrieval challenges. However, the sketching process has
somehow been ignored, even though the sketching process is also very important for us
to understand the sketch modality, especially considering the unique temporal characteristics
of sketches. in this thesis, we proposed the rst deep photo-to-sketch synthesis
framework, which has provided good performance on sketch synthesis task, as shown in
the experiment section.
Generalisability is an important criterion to judge whether the existing methods are able
to be applied to the real world scenario, especially considering the di culties and costly
expense of collecting sketches and pairwise annotation. We thus proposed a generalised
ne-grained SBIR framework. In detail, we follow the meta-learning strategy, and train
a hyper-network to generate instance-level classi cation weights for the latter matching
network. The e ectiveness of the proposed method has been validated by the extensive
experimental results
Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval
Human sketches are unique in being able to capture both the spatial topology of a visual object, as well as its subtle appearance details. Fine-grained sketch-based image retrieval (FG-SBIR) importantly leverages on such fine-grained characteristics of sketches to conduct instance-level retrieval of photos. Nevertheless, human sketches are often highly abstract and iconic, resulting in severe misalignments with candidate photos which in turn make subtle visual detail matching difficult. Existing FG-SBIR approaches focus only on coarse holistic matching via deep cross-domain representation learning, yet ignore explicitly accounting for fine-grained details and their spatial context. In this paper, a novel deep FG-SBIR model is proposed which differs significantly from the existing models in that: (1) It is spatially aware, achieved by introducing an attention module that is sensitive to the spatial position of visual details: (2) It combines coarse and fine semantic information via a shortcut connection fusion block: and (3) It models feature correlation and is robust to misalignments between the extracted features across the two domains by introducing a novel higher-order learnable energy function (HOLEF) based loss. Extensive experiments show that the proposed deep spatial-semantic attention model significantly outperforms the state-of-the-art
Generalizable Person Re-identification by Domain-Invariant Mapping Network
We aim to learn a domain generalizable person reidentification (ReID) model. When such a model is trained on a set of source domains (ReID datasets collected from different camera networks), it can be directly applied to any new unseen dataset for effective ReID without any model updating. Despite its practical value in real-world deployments, generalizable ReID has seldom been studied. In this work, a novel deep ReID model termed Domain-Invariant Mapping Network(DIMN) is proposed. DIMN is designed to learn a mapping between a person image and its identity classifier, i.e., it produces a classifier using a single shot. To make the model domain-invariant, we follow a meta-learning pipeline and sample a subset of source domain training tasks during each training episode. However, the model is significantly different from conventional meta-learning methods in that: (1) no model updating is required for the target domain, (2) different training tasks share a memory bank for maintaining both scalability and discrimination ability, and (3) it can be used to match an arbitrary number of identities in a target domain. Extensive experiments on a newly proposed large-scale ReID domain generalization benchmark show that our DIMN significantly outperforms alternative domain generalization or meta-learning methods
Deep Multi-task Attribute-driven Ranking for Fine-grained Sketch-based Image Retrieval
Fine-grained sketch-based image retrieval (SBIR) aims to go beyond conventional SBIR to perform instance-level cross-domain retrieval: finding the specific photo that matches an input sketch. Existing methods focus on designing/learning good features for cross-domain matching and/or learning cross-domain matching functions. However, they neglect the semantic aspect of retrieval, i.e., what meaningful object properties does a user try encode in her/his sketch? We propose a fine-grained SBIR model that exploits semantic attributes and deep feature learning in a complementary way. Specifically, we perform multi-task deep learning with three objectives, including: retrieval by fine-grained ranking on a learned representation, attribute prediction, and attribute-level ranking. Simultaneously predicting semantic attributes and using such predictions in the ranking procedure help retrieval results to be more semantically relevant. Importantly, the introduction of semantic attribute learning in the model allows for the elimination of the otherwise prohibitive cost of human annotations required for training a fine-grained deep ranking model. Experimental results demonstrate that our method outperforms the state-of-the-art on challenging fine-grained SBIR benchmarks while requiring less annotation
Deep Factorised Inverse-Sketching
Modelling human free-hand sketches has become topical recently, driven by
practical applications such as fine-grained sketch based image retrieval
(FG-SBIR). Sketches are clearly related to photo edge-maps, but a human
free-hand sketch of a photo is not simply a clean rendering of that photo's
edge map. Instead there is a fundamental process of abstraction and iconic
rendering, where overall geometry is warped and salient details are selectively
included. In this paper we study this sketching process and attempt to invert
it. We model this inversion by translating iconic free-hand sketches to
contours that resemble more geometrically realistic projections of object
boundaries, and separately factorise out the salient added details. This
factorised re-representation makes it easier to match a free-hand sketch to a
photo instance of an object. Specifically, we propose a novel unsupervised
image style transfer model based on enforcing a cyclic embedding consistency
constraint. A deep FG-SBIR model is then formulated to accommodate
complementary discriminative detail from each factorised sketch for better
matching with the corresponding photo. Our method is evaluated both
qualitatively and quantitatively to demonstrate its superiority over a number
of state-of-the-art alternatives for style transfer and FG-SBIR.Comment: Accepted to ECCV 201
Universal Sketch Perceptual Grouping
In this work we aim to develop a universal sketch grouper. That is, a grouper that can be applied to sketches of any category in any domain to group constituent strokes/segments into semantically meaningful object parts. The first obstacle to this goal is the lack of large-scale datasets with grouping annotation. To overcome this, we contribute the largest sketch perceptual grouping (SPG) dataset to date, consisting of 20, 000 unique sketches evenly distributed over 25 object categories. Furthermore, we propose a novel deep universal perceptual grouping model. The model is learned with both generative and discriminative losses. The generative losses improve the generalisation ability of the model to unseen object categories and datasets. The discriminative losses include a local grouping loss and a novel global grouping loss to enforce global grouping consistency. We show that the proposed model significantly outperforms the state-of-the-art groupers. Further, we show that our grouper is useful for a number of sketch analysis tasks including sketch synthesis and fine-grained sketch-based image retrieval (FG-SBIR). © Springer Nature Switzerland AG 2018
On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks
Learning-based methods to solve dense 3D vision problems typically train on
3D sensor data. The respectively used principle of measuring distances provides
advantages and drawbacks. These are typically not compared nor discussed in the
literature due to a lack of multi-modal datasets. Texture-less regions are
problematic for structure from motion and stereo, reflective material poses
issues for active sensing, and distances for translucent objects are intricate
to measure with existing hardware. Training on inaccurate or corrupt data
induces model bias and hampers generalisation capabilities. These effects
remain unnoticed if the sensor measurement is considered as ground truth during
the evaluation. This paper investigates the effect of sensor errors for the
dense 3D vision tasks of depth estimation and reconstruction. We rigorously
show the significant impact of sensor characteristics on the learned
predictions and notice generalisation issues arising from various technologies
in everyday household environments. For evaluation, we introduce a carefully
designed dataset\footnote{dataset available at
https://github.com/Junggy/HAMMER-dataset} comprising measurements from
commodity sensors, namely D-ToF, I-ToF, passive/active stereo, and monocular
RGB+P. Our study quantifies the considerable sensor noise impact and paves the
way to improved dense vision estimates and targeted data fusion.Comment: Accepted at CVPR 2023, Main Paper + Supp. Mat. arXiv admin note:
substantial text overlap with arXiv:2205.0456